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ABSTRACT 

A new approach to assessing unexpected differential 
item performance (item bias or item fairness) was developed and 
applied to the item responses of males and females to Scholastic 
Aptitude Test and Test of Standard Written English items administered 
operationally in December 1977. While the main body of the report 
describes the particulars of the present application and delineates 
the essential features of the approach, a technical appendix 
describes the standardization approach in detail. The primary goal of 
the standardization approach is to control for differences in 
subpopulation ability before making comparisons between subpopulation 
performance on test items. By so doing, it removes the contaminating 
effects of ability differences from the assessment of item fairness. 
Of the total of 195 items studied, the standardization approach 
identified only a handful as meriting careful review for possible 
content bias. Of these few, only one item exhibited a clearly 
unacceptable degree of unexpected differential item performance 
between males and females that could be attributed to content bias. 
(Author) 
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Abstract 

« 

A new approach to assessing unexpected differential item performance (item 
bias or item fairness) is developed and applied to the item responses of males 
and females to SAT/TSWE items administered operationally in December 1977. 
While the main body of the report describes the particulars of the present 
application and delineates the essential features of the approach, a technical 
appendix describes the standardisation approach in detail. The primary goal of 
the. standardization approach is to control for differences in subpopulation 
ability before making comparisons between subpopulation performance on ^est 
items. By so doing, it removes the contaminating effects of ability differences 
from the assessment of item fairness. Of the total of 195 items studied, the 
standardization approach identified only a handful as meriting careful review 
for possible content bias. Of these few, only one item exhibited a clearly 
unacceptable degree of unexpected differential item performance between males 
and females that could be attributed to content bias. 



ASSESSING UNEXPECTED DIFFERENTIAL ITEM PERFORMANCE 
OF FEMALE CANDIDATES ON SAT AND TSWE FORMS 
ADMINISTERED IN DECEMBER 1977: 
AN APPLICATION OF THE STANDARDIZATION APPROACH 

Those who develop and review the Scholastic Aptitude Test (SAT) are 
aware of the diversity of the test-taking population and attempt to construct 
tests based on a broad sampling of tasks and topics that tend not to favor any 
subgroup of the population. Donlon (1981) discussed the checks that are performed 
on the SAT to guard against favoritism towards any subgroup. In that article, 
Donlon summarized procedures used in the test development process to ensure that 
'Items or Lest questions are appropriate for various subgroups as well as the 
types of statistical checks performed to evaluate item appropriateness. 

Carlton and Marco (1982), in a review of methods used at Educational 
Testing Service to detect and eliminate possible favoritism in items, discussed 
several studies that have examined performance on SAT items across different 
subpopulatlons. Included In their review were six studies that were conducted 
to monitor differential Item performance of various groups on several forms of 
the SAT and Its companion test, the Test of Standard Written English (TSWE). 
The purposes of this monitoring are: 

(1) to ensure that the SAT and TSWE remain appropriate over time for major 
subgroups of the SAT candidate population, and 

(2) to identify possible content factors related to differential item 
performance that would help test developers construct fair tests* 

Dorans (1982) reviewed the five of those six studies that examined Black/ 
White candidate performance on SAT/TSWE items from forms of the SAT/TSWE that 
have the current content and format specifications. In the present report, 
the statistical method of standardization is used to examine whether there are 
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unexpected differences In Item perfornuince across different subpopulations of 
the Scholastic Aptitude Test test-taking population. 

Unexpected Differential Item Performance 

Unexpected differential item performance exists when there are differences 
in item performance that cannot be accounted for by differences in subgroup 
ability. An item is exhibiting unexpected differential item performance when 
the expected performance on the item is lower for examinees from one group 
than for examinees of equal ability from another group or other groups. If 
we let S represent ability as measured by total score^ on the standard 
College Board 200-to-800 SAT scale (or on the 20-to-60 TSWE scale), and X repre- 
sent an item score (1 if the answer to the question is correct and 0 if the 
answer is incorrect), then an item is free of unexpected differential item 
performance when it satisfies the following equality 



(1) p (x-ljs) = P ,(X=l|s) for all subpopulations g and g', 

8 8 



where P (X-lls) is defined as the probability that candidates from subpopulatton 

g 

g who have total test scores equal to S will answer the item correctly. • For 
example, if male and female candidates with the same total test scores do not 
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It is recognized that use of reported scaled score as the control variable can 
be criticized because it is not a perfect measure of ability and because it is 
an internal criterion, i.e., performance on an item is related to total score 
performance in part because that item went into the determination of total 
score. Nonetheless, reported scaled score is probably the best control variable 
available for studies of unexpected differential item performance. 
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have equal probabilities of successful performance on an itenii this difference 
is taken as evidence of unexpected differential item performance for male and 
female candidates at that particular score level. Note that lack of unexpected 
^ differential item performance does not imply that there are no differences in 

item performance across subgroups of the Scholastic Aptitude Test candidate 
population. Unexpected differential item performance does not refer to differ* 
ences in overall subgroup performance on an item but rather to differences in 
conditional ltem,,pjerf ormance where the requisite condition before comparison is 
identical total test score. 

Several methods have been suggested for identifying unexpected differential 
item performance, or item bias as it Is frequently referred to in the literature. 
The handbook by Berk (1982) attests to this fact. For a single comprehensive 
review of the more popular methods , including the transformed item difficulty or 
delta-plot method, item response theory methods and chi-square approaches see 
Shepard, Camilli and Averill (1981). Most of these methods, however, have 
exhibited undesirable sensitivities to differences in overall subpopulation 
ability or differences in item quality (discrimination). Two of these methods 
(transformed item difficulty and a chi-square approach) were employed in earlier 
studies of the Scholastic Aptitude Test that were reviewed by Dorans (1982). 
Both methods are subject to misclassif ying items as unfair towards a particular 
subgroup because of methodological sensitivities to differences in subpopulation 
ability. The methodology employed in the current study controls for differences 
in subp<^pulat Ion ability through the statistical method of standardization. 

« 

Standardization is a technical term that, unfortunately, has more than one 
meaning. In one usage, standardization typically refers to a numerical oper- 
ation which transforms a set of numbers with a particular mean (average score) 
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and standard deviation (spread of scores about the average score) to a set of 
numbers that has a certain "standard" mean and standa^rd deviation. This is not 
the meaning of standardization as used in this report. 

Rather, we shall use standardization to mean that one variable is stand- 
ardized with respect to some other variable before making comparisons between 
groups. This type of standardization enables one to control for differences in 
subpopulation ability while making comparisons of the performance of these 
subpopulations on items. The procedures used in this study require a very large 
data base in order to ensure the stability of the conditional probabilities ^ 
obtained at each score level in each subpopulation under investigation. Fortu- 
nately, there are large data bases available for the Scholastic Aptitude Test. 
Other methods of standardization may be used with smaller sample sizes, e.g. 
Alderman and Holland (1981). A general approach to assessing unexpected 
differences in item performance via standardization is described in detail in 
the appendix, where a mathematical formulation is presented and the method's 
similarities to and differences from the item response theory approach is 
discussed. 
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Standardization 

In this section, the essential features of standardization are described. 
The conditional probability of successful^ performance on an item, P (X«l|s), 
is the raw datum for the standardization method. For each score level S, there 
is a conditional probability of successful performance. Studies of unexpected 
differential item performance focus on differences in condition^^l probability of 
successful item performance between a study group and a base group. In this 
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first study, female SAT candidates are the study group, while male candidates 

are the base group. 

Figure 1 contains plots of the conditional probability of successful 

\ 

performance for both males and females on an analogy item appearing on Form 
ZSA5. Male conditional percent corrects are denoted by squares (□) at each 
score level, while female conditional percent corrects are denoted by asterisks 
(*). (Note that there are no asterisks at scaled scores of 770 and 800, which 
indicates there were no females at those two scaled score levels.) In this 
particular figure, the asterisks and squares tend to lie on top of one another. 
This consistent and high degree of overlap is evident in Figure 2, which is a 
plot of differences in conditional probabilities for this item. Note that 
almost all the asterisks in Figure 2 lie very close to the line of zero differ- 
ence. This particular analogy item exhibits very little unexpected differential 
item performance. 

The analogy item portrayed in Figures 3 and 4 serves as a striking con- 
trast to that depicted in Figures 1 and 2. Here, the squares (males) are higher 
than the asterisks (females) at almost every scaled score level. In fact, 
between scaled scores of 250 and 500, the difference between the female condi- 
tional probabilities and the male conditional probabilities tends to be .2, 
i.e., the probability that a male with a given scaled score in that range will 
^answer that analogy item correctly exceeds the probability that a female with 
the same exact scaled score will answer the item correctly by the substantial 
amount of .2. Clearly, this particular item exhibits a substantial amount of 
unexpected differential item performance. 

Examination of conditional probability plots such as those depicted in 
Figures 1 and 3 and difference plots like those in Figures 2 and 4 enables 
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Conditional Probability of Successful Item Performance 
for Both Males and Females on Two 
Verbal Itens from SAT Form ZSA5 
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Difference Plots of Two Verbal Items from SAT Form ZSA5 
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one to look for evidence of unexpected differential Item performance at fixed 
score levels. In effect, the plots allow one to control foiT ability before 
comparing Item performance across subpopulatlons. Consequently, for each Item 
there is potential for unexpected differential item performance that can be 
summarized via some numerical index. One such index is the difference in 
conditional probabilities of successful performance at that score level. If 
there are 61 observed score levels, such as there are on the College Board SAT 
scale that ranges from 200-to-800 in steps of 10, then there are 61 such differ 
ences for each item. Clearly there exists a need for an economical summary of 
these differences. Standardization provides that summary. 

The application of the standardization procedure, in which the marginal 
ability distribution of the female standardization group serves as a weighting 
function, yields several summary indices of item performance. First, there is 
the observed percent correct for the female study group obtained by taking 
a weighted sum of the 61 conditional probabilities of successful performance 
observed in the female study group, where the relative frequencies at each 
of the 61 scaled score levels in the female study group serve as the weights. 
These same weights are applied to the 61 conditional probabilities observed in 
the male base group to produce an index of expected item performance for the 
female study group. The difference between and P^, * P^ - P^, is one 
index of unexpected differential item performance. If there is no unexpected 
differential item performance, should equal zero. A positive indicates 
that the study group exceeds its expected performance, while a negative 
indicates that the item is harder than expected for the study group. Since 

is a signed index, it is Insensitive to crossovers in the conditional 
success distributions of the base and study groups. An unsigned discrepancy 
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index that can be used with is the root mean weighted squared difference 
(RMWSD^). The RMWSD^ for an item is obtained by weighting each difference in 
conditional probabilities of successful item performance between the study and 
base groups by that difference (which is equivalent to squaring the difference) 
and by the frequency of scores in the female standardization group at each scale 
score level, summing this weighted difference across the 61 scaled score levels, 
dividing this sum by the number of candidates in the standardization group, and 
taking the square root of the result* The mathematical formula for the RMWSD^ is 

(2) RMWSD^ = ( ' N^^^(P^^ - I Z N,^^)^/2 

s=l 5=1 

where S is the number of score levels, N^^^ is the number of individuals at score 
level s in subpopulatlon f, P^^ is the conditional probability of successful 
per/formance in subpopulation f at score level s, and P^^ is the predicted value 
of P^^. Note that typically P^^ = Pj^^. where Pj^^ is the conditional probability 
of successful performance observed at score level s in the male base group. 
Given the definition of as 

S . S 

(3) Dp = ^ Nfs+^^fs - ^fs> ' ^^f8+ 



s=i s-1 



it can be shown that 



2 ^ 2^1/2 
(4) RMWSD = (Df + E N^^^CD^^ " ^^f > . / ^ ^fs*^ 

s=l S"l 

t > ■ ^ 
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where » P^^ - . Since, this Index Is unsigned, any difference produces a 
fs fs fs 

positive discrepancy. Consequently, every Item will have a positive RMWSD^, 

An Item exhibiting substantial unexpected differential Item performance will 

have a large RMWSD^. 

Equation (4) expresses RMWSD^ as the square root of two additive components, 

2 

the square of a constant directional discrepancy, which Is D^, and an Index 

of residual crossover. I.e., a sum of weighted squared differences In conditional 

probabilities after adjusting for the constant difference, which is the second 

2 

component in (4). While the portion is probably systematic and indicative 
of unexpected differential item performance, the residual crossover component 
may or may not be indicative of systematic unexpected differential item perform- 

• ance because it does not allow random differences to cancel out. As such, the 

at 

primary purpose of the residual crossover component is to flag an item for 
closer examination. 

A problem faced by any investigation which seeks to detect and quantify 
unexpected differential item performance, regardless of methodology, is the 
determination of what level of unexpected differential item performance should 
evoke concern. One could argue that any difference Should evoke concern. This, 
however, would be an extreme position that Ignores the fact that measurement 
systems are always contaminted by noise. In the preslent study, we examined 
distributions of root mean weighted squared differences (RMWSO-) to empirically 
determine a cutoff point which defines a substantial amount of unexpected 
differential Item performance. Examination of these frequency distributions led 
us to conclude that an Item with a RMWSD^ greater than or equal to .08 merits 
careful Investigation, while an Item with a RMWSD, less than .08 doias not 
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require additional study. As we acquire more experience with applying the 
standardisation approach to other data bases, a better cutoff may evolve. 

In combination, and RMWSD^ provide a statistical description of an item 
that will enable us to ascertain the degree of unexpected differential item 
performance obtained in the female study group. 

Test Form and Sample Used in This Study 

SAT Form ZSA5 and TSWE Form Ell, administered in December 1977, were used 
in this study. Stern (1977) previously described the psychometric properties of 
TSWE Form Ell; and Cook and Nutkowitz (1979), the psychometric properties of SAT 
Form ZSA5. Since the psychometric properties of ZSA5/E,11 are described in 
detail in the test analysis reports just cited, only the most salient character^ 
istics are summarized here. Both the verbal and mathematical sections of Form 
ZSA5 had fairly typical reliabilities (and scaled score standard errors of 
measurement) of .914 (32) and .916 (33), respectively, in a spaced sample of 

I, 895 candidates from the total group of 166,311 candidates who took Form ZSA5 

in December, 1977. The mean equated delta, an index of test difficulty described 
by Hecht and Swineford (1981) and Walker (1981), for tlie verbal section was 

II. 3, which indicated the test was slightly easier than intended. For the ; 
mathematical section, the mean equated delta was 12.4, slightly more difficult 
than intended. TSWE Form Ell had a fairly typical reliability of .887 in a 
spaced sample of 1,615 candidates from the total. group of 84,144 who took TSWE • 

Form Ell 'Vn June 1976. The mean equated delta was 9.3, slightly easier than 

It 

intended. 

15 
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The basic data for this study were the item responses of 21,835 male 
candidates and 21,209 female candidates who took the 85 verbal, 60 mathematical 
and 50 TSWE items that appeared in the operational sections of the Forms ZSA5 
and Ell that were administered in December, 1977. The combined sample of 43,044 
was representative of the total group that took ZSA5/E11 at that administration. 

\ \ 

Procedurte ^ 

The focus of the present study is on the assessment of unexpected differ- 
ential item performance for female candidates on Forms ZSA5 and Ell items. In 
this particular .application of the general standardization technique, the study 
group is the female candidate subpopulation. The standardization group supplies 
the standard ability distribution used by the standardization approach. ^Any 
subgroup including a composite group or a hypothetical group can be used as the 
standardization group. Since the standard ability distribution serves as a 
weighting function, it is advisable to use each study group as its own standard- 
ization group thereby enabling use of a weighting function that mirrors the 
relative frequency at each score level in the study group. The male candidate 
subpopulation, as the majority group, was chosen as the base group, i.e., the 
subpopulation that supplies the model for item performance as a function of 
ability. The model is the conditional probability of successful performance on 
the item given ability. The largest subpopulation was used as the base group in 
order to produce the most statistically stable model of item performance given 
test score that can be attained. Table 1 contains the marginal score distri- 
bution for the female study group and male base group for SAT-Verbal, SAT- 
Mathematical, and TSWE. Note that the largest weights (relative frequency in 

16 . 
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Table 1 



Frequency Dl.trlbutlon. «>d Su—ry St.tl.tlc. of M.1..' «>d Fea.le.' Verb.l. M.thc«.tlc.l. «»1 TSWE 
Scsled Scorss 



VERBAL 



MATHEMATICAL 



TSWE 



Scaled 
Score 

aoo 

7du 
/70 
7feD 
rbC 

/4fC 

7iO 
720 
710 

^90 
fc7s/ 

eeu 

(40 
t30 
620 
610 

f lC 
59 c 

571 
0 

tbO 

540 

52 C 

520 

51u 

50C 

490 

4i}3 

470 

46 J 

4^>u 

440 

42'* 

4^0 

410 

4 CO 

3<iG 

?bO 

3(0 
3'.0 
34C 
33J 
32C 
3U 
3wU 
290 
2mO 
270 
26C 
25i> 
24C 
230 
220 
210 
2yO 



Heen 

S.D. 



Male 



Female 



Male 



Female 



£ 


X belov 


f 


X belov 


1 


100.0 


c 


ICO.O 


1 


lOC.O 


I 


100.0 


I 


10C*.0 


1 


100. 0 


2 


100.0 


0 


100. 0 




ICrO. 3 


7 


100.0 


10 


99.9 


8 


99.9 


14 


99.8 


25 


99.8 




99. 8 


14 


99.7 


12 


99.7 


9 


99.7 


49 


99.5 


33 


99.5 


1 1 


99.4 


21 


99.4 


6 1 


99.2 


46 


99.2 


49 


98.9 


32 


99.1 




9 8.5 


75 


96.7 


74 


98.2 


49 


98.5 


66 


97.9 


57 


98.2 


165 


97.1 


150 


97.5 


91 


96.7 


47 


97.3 


224 


95.7 


191 


96.4 


1 26 


95.1 


120 


95.8 


25C 


9^.0 


2 17 


94.8 


I 5? 


93.3 


122 


94.2 


36d 


91.6 


317 


92,7 


?^*1 


90.7 


196 


91.8 


192 


89.8 


152 


91.1 


476 


67.6 


434 


89.0 


2c9 


86.4 


251 


87.8 


536 


83.9 


524 


85.4 


341 


82.3 


318 


83.9 


672 


79.3 


662 


80.8 


379 


77.5 


362 


79.1 


737 


74.2 


7 29 


75.6 


4(7 


72.0 


428 


73.6 


48? 


69.6 


442 


71.5 


967 


65.4 


895 


67,3 


505 


63.1 


487 


65.0 


1062 


58.1 


1055 


60.0 


5f9 


55.5 


543 


5*. 5 


iOC'9 


5C.9 


949 


53.0 


507 


4 6.2 


5 70 


5C.3 


545 


45.7 


611 


47.4 


1016 


41.0 


972 


42.8 


553 


?8.5 


567 


40.2 


1128 


33.3 


1021 


35.3 


554 


3 0. 8 


511 


5 c • f 


H60 


26.9 


813 


29.1 


477 


24.7 


475 


26.9 


90H 


20.5 


891 


22.7 


358 


18.9 


290 


21.3 


444 


16.8 


358 


19.6 


703 


13.6 


775 


16.0 


3«7 


11.9 


334 


14.4 


567 


9.3 


545 


11.8 


261 


8.1 


296 


10.4 


504 


5.8 


582 


7.7 


150 


5.1 


205 


6.7 


360 


3.4 


449 


4.6 


162 


2.7 


188 


3.7 


12B 


2.1 


183 


2.8 


1B2 


1.3 


220 


1.8 


275 


;o.o 


382 


0.0 


L,835 




21,209 





f X 


belov 


f 


X belov 


13 


99.9 


0 


100.0 


17 


99.9 


2 


100. 0 


17 


99.8 


5 


inc.o 


30 


99.6 


3 


100.0 


48 


99.4 


• 6 


99.9 


6w 


99.2 


15 


99.9 


92 


98.7 


15 


99.8 


151 


98.0 


32 


99.6 


120 


97.5 


24 


99.5 


134 


96.9 


38 


99.3 


154 


96.2 


40 


99.2 


148 


95.5 


44 


Q8.9 


195 


94.6 


70 


98.6 


197 


93.7 


66 


98.3 


256 


92.5 


71 


98.0 


219 


91.5 


82 


97.6 


234 


90.5 


89 


97.2 


29 3 


89.1 


122 


96.6 


205 


87.7 


134 


96.0 


345 


86.1 


146 


95.3 


655 


83. 1 


39C 


93.4 


429 


81.2 


221 


92.4 


397 


79.3 


246 


91.2 


435 


77.4 


267 


90.0 


376 


75.6 


265 


88.7 


513 


73.3 


324 


87.2 


1080 


68.3 


692 


63.9 


500 


66.0 


346 


82.3 


497 


63.8 


391 


80.5 


568 


61.2 


4 84 


78.2 


601 


58.4 


468 


75.9 


599 


55.7 


496 


73.5 


1109 


50.6 


1314 


68.7 


619 


47.8 


542 


66.2 


625 


44.9 


567 


63.5 


562 


42.3 


539 


61.0 


514 


39.9 


545 


5 8.4 


1096 


34.9 


1292 


52.3 


541 


32.4 


619 


49*4 


460 


30.3 


533 


46.9 


471 


28.2 


562 


44.2 


483 


26.0 


6 57 


41.1 


494 


23.7 


603 


38.3 


842 


19.8 


1035 


33.4 


449 


17.8 


536 


30.9 
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the female study group) tend to be given to scores between 240 and 550 on the 
verbal scale, scores between 260 and 540 on the mathematical scale, and scores 
between 30 and 60 on the TSWE scale (a relatively large weight is also assigned 
to 20), 

Results 

SAT Verbal Results 

Table 2 contains listings of four indices described earlier, P^, P^, D^, 
and RMWSD^, and the observed percent correct in the male base group, P^, for the 
85 verbal items of Form ZSA5. In addition, it includes the means and standard 
deviations of these five indices displayed by item type. 

The first row of the summary portion of Table 2 contains statistics based 
on all 85 verbal items. Note that mean P^ and mean are equal to two decimals. 
The difference between mean (.00) and mean RMWSD^ (.05) is attributed to the 
fact that RMWSD^, unlike D^, is an unsigned index of discrepancy that weights 
and sums any squared differences between P^ and P^ regardless of which value is 
larger and thus prevents cancellation of positive and negative differences. On 
the other hand, the signed index expresses the amount by which total differ- 
ences in one direction exceed total differences in the other direction. 

The next row in Table 2 displays the means and standard deviations of 
the five indices computed on the vocabulary items only. Again, mean P^ and mean 
P^ are nearly equal. Both discrepancy indices are also small. The vocabulary 
items can be divided still further into antonym items and analogies items. Mean 
percent correct on these item types are even less related to scaled scores 
than previous item groupings, and so differences in mean percent correct are 



18 



-}A- 



Table 2 



Listing of Item Difficulty and Discrepancy Indices and 
Sunmary Statistics for Verbal Items from SAT Form ZSA5 









p 


f 


11EH t 




8 CCfMECf 


ESI f CO^^Ml 


1 


ANTCNVN 


0.90C5 


0.8607 


t 


AMJMVN 


1.7143 


^.6864 


} 




0,(I098 


0.7932 


i 




0.7062 


0.C928 


5 




0.7466 


0.7C66 


* 


ANICSVN 


U.C899 


0.74C1 


T 




C.5125 


0.5658 


• 




0.4818 


0.4787 






C.5796 


0.5156 




A .irNVN 


C.?223 


0.1910 


11 




0.3449 


0.3452 


12 




0.2924 


C.2664 


13 


ANIJ^VH 


0.3C09 


0,243« 


14 


AmICSVH 


O.Cr86 


C.0801 


15 


AMONVH 


0.1383 


r. 1256 


le 


StNT 


ccx 


0.7825 


0.8373 


17 






C.e9t9 


0.7435 


IS 




CCN 


C.6951 


0.6774 


1^ 


StNl 


CCH 


3.6914 


0.7wlO 


2w 


St NT 


CGH 


I..4C32 


0.4»955 


21 


^cAJ 


CCH 


r.5361 


0.5158 


22 


f £AU 


CCN 


0.^918 


0.6866 


23 


PtAO 


CC«1 


0.C321 


0.6423 


2^ 


liFAO 


CCH 


0.5616 


C • 5 70 1 


25 


f f A '> 


CLS 


0.2607 


0.2360 


2t 


»<t AO 


CfH 


0.C917 


C. 1075 


21 


f LAD 


CCN 


J.2331 


(*.2399 


28 




CCJi 


C.1142 


f>. 1422 


29 




CuH 


C.13C7 


0. 1910 


30 


P i V) 


rcM 


0.1^68 


0. 1 395 


31 


SEM 


rcH 


C •t;455 


0. 8345 


32 


StM 


CCH 


• f .482? 


0 .5C69 


33 


St M 


rcH 


C.4469 


V. S Jmm 


34 


Str^T 


CCN 


C.3343 


C . 9c79 


35 


SEM 


CCN 


0.13C8 


0. 1 1 86 


36 


AN'AlCay 


0.7601 


w . 1 9 9^ 


37 


ANALCGV 




0. 7005 


3d 


A'^Af CGV 


0.6261 


0.6401 


3« 


ASAICGV 


0.5276 


0.5117 


4C 


Af.ALCGV 


^•4669 


0.4515 


41 


ANAiUGV 


0,'^405 


0.3647 


42 


ANALCGV 


0.2447 


0.2112 


43 


ANAICGV 


0.1367 


0.1656 


44 


ANAtCGV 


0.1496 


0.1979 


45 


ANAiCGV 


0.(/67l 


0.0884 


46 


AMONVN 


0.8831 


0.8607 


47 


AN1LNVN 


C.8085 


0.7684 


4fl 


AU1CNVN 


0.7160 


C.8034 


49 


A*ilCNVN 


C.7C3 7 


C.f 639 


5C 


ANTfNVN 


0.4443 


C.4015 


51 


ANTCNVN 


C.475C 


0.4631 


52 


AN7CNVN 


0.4575 


n,3872 


93 


A*iTCKV(4 


C.3569 


0.3755 


54 


ANiONVn 


0.1283 


0.1732 


55 


AilCNyW 


C;lC79 


0.1038 


56 


SENI 


CCN 


0.7701 


0.8155 


57 


S€N1 


CON 


0.6744 


0.6244 


5t 




CON 


0.7046 


0.7139 


59 


SE^^I 


CGN 


0.4493 


0.4439 


eo 




CCiN 


0.2318 


6.2051 



OUF f CCPf'ECT 


8nso 


X CORRECT BASE GROUP 


0.03«8 


0.0908 


0.8702 


0.0279 


C.0406 


0.7045 


C.0U4 


C.C442 


0.8030 


0.0134 


G.0293 


0.71C4 


0.0600 


C.0693 


0.7223 . 


-0.0511 


C.C619 


0.7531 


-C.0533 


0.0661 


0.9777 


0.0031 


e.0273 


0.4926 


0.0640 


C.C697 


0. 5276 


C.0313 


0.C547 


0.1944 


-C.0002 


0.0339 


C.3531 


0.0261 


0.C401 


0.2 745 


0.0570 


0.C794 


0.2517 


-C.0015 


0.0435 


0.0801 


0.0127 


0.C524 


C.1268 


-0.0547 


0.0716 


C.8523 


-0.0466 


0.0603 


0.7572 


C.0176 


O.C309 


0.6919 


-0.0094 


0.0307 


0. 7147 


-0.0923 


C.1054 


0.5047 


0.0203 


C.0391 


0.5287 


0.0052 


0.C362 


0.6993 


-C.0102 


0.0330 


0.6943 


-C.0085 


0.0316 


C.58C1 


0.0248 


0.0471 


0.2420 


-0.0158 


C.C381 


0.1111 


-0.0097 


C.0428 


0.2482 


-C.0280. 


0.0503 


0. 1450 


-0.0603 


C.0760 


0. 1974 


C.C173 


0.C389 


1430 


0.0111 


W.0403 


C. 845T 


-C.0246 


0.C409 


C. 51 tt2 


0.0681 


r .c788 


0. 3899 


0.0104 


0.C369 


U. 3359 


0.0122 


U. CZ '9 


0. 1 255 


-0.0752 


0.C843 


0.8443 


0.9429 


0.C542 


0.7149 


-0.0140 


0.0293 


0.6541 


C.C159 


C.C253 


C.5274 


0.0194 


0.C343 


0.4643 


-0.0242 


0.0381 


0.3756 


0.0334 


0.0466 


0.2193 


-0.C289 


0.0480 


0.1733 


-0.0483 


0.0691 


0.2C47 


-0.0212 


0.r)34 


0.C912 


0.0229 


0.0327 


0.8724 


0.0401 


0.0520 


0.7823 


-0.C874 


C.C979 


0.8173 


0.0398 


0.0489 


0.6773 


0.0428 


C.0561 


0.4153 


0.0119 


0.0393 


0.4760 


0.C703 


0.0832 


C.3980 


-0.01S6 


C.032T 


0.3866 


-0.0449 


0.0714 


C.1781 


0.C041 


0.0310 


0.1075 


-0.0494 


C.060* 


0.8299 


0.0900 


e.0661 


0.6392 


-0.0094 


0.0260 


C.73C3 


C.0493 


0.0569 


0.4984 


0.0267 


0.0424 


0.2125 



ERIC 



19 



-15- 



Table 2 (continued) 
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more likely to appear among antonyms or analogies item type groupings than in 
the vocabulary items as a whole or the entire verbal test. 

The next two rows of Table 2 list the means and Standard deviations of 
the five indices across the antonyms and analogies item types, respectively. 
The values of RMWSD^ are still of approximately the same size as before. The 
magnitudes of are slightly larger than before, yet still small in an 
absolute sense. <^ 

The statistics for reading, the other section in the verbal test, and the 
two item types that compose it, sentence completion and reading comprehension, 
and the corresponding statistics from their items are posted in the last three 
rows of Table 2. None of these indices exhibit disconcerting amounts of unex- 
pected differential item performance. 

Even if the overall level of unexpected differential item performance 
In a set of items is tolerable, there may be some small number of items which 
exhibit substantial unexpected differential item performance that is not readily 
detectable from the means and standard deviations of discrepancy indices such a9 
RMWSD^ and D^. For an item level analysis, careful examination of the frequency 
distribution of a discrepancy index such as RMWSD^ can be informative. A combi- 
nation numerical/pictorial display of the frequency distribution of the RMWSD 
index on all verbal items grouped by subscore and by item type is presented in 
Figure 5. The floating histogram in Figure 5 is a clear presentation of the 
RMWSD^ index that can be used to identify individual items that exhibit 
unusually high amounts of unexpected differential item performance. Note how 
the single analogy item with a RMSWD of .18 clearly stands out in this figure. 
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An alternative pictorial representation of the distribution of this index 
that conveys even more information is given in Figure &• In this figure, where 
each item type is denoted by a different symbol, the RMWSD^ for an item is 
represented by the length of the line from the origin to the point representing 
that item. To supply a frame of reference, three arcs of equal RMWSD^ are drawn 
on the plot for the values .08, .16 and .24. Items falling within the smallest 
arc exhibit a fairly typical amount of RMWSD^. Items falling between the 
smallest and middle arc should be examined more closely. Items falling outside 
the middle arc are very unusual and clearly exhibit a large amount of unexpected 
differential item performance. 

As described earlier, the RMWSD^ for each item can be expressed as the 
square root of two additive components, the square of a constant directional 
discrepancy, which is D^, and an index of residual crossover, i.e., a sum of 
weighted squared differences in conditional probabilities after correction for 
the constant difference, which is referred to as the variance of the weighted 
differences. (See equation (4).) Projection of each point in Figure 6 on the 
horizontal axis yields the D^, the difference between and P^, for that item. 
Projection of that same point on the vertical axis yields the standard deviation 
of the weighted differences, the index of residual crossover. Hence, the 
location of each point in Figure 6 indicates not only the degree of unexpected 
differential item performance (RMWSD^), but also the extent to which that 
RMWSD^ is due to a constant difference between the P^ and P^ curves (and the 
direction of that difference: D^), and the extent to which the item exhibits 
residual crossover, the height of the point above the horizontal axis. 

The analogy item depicted in Figures 3 and 4 is the only verbal item which 
falls outside the second arc of Figure 6. It is also the item in Figure 5 that 
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Figure 6 



Plot of Root Mean Weighted Squared Differences (RMWSD ) Between 
the Conditional Probabilities of Success for Male and Female 
Candidates on Verbal Items from SAT Form ZSA5 
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is off by itself in the floating histogram at the top where it has a RMWSD^ of 
•1792. Clearly this index indicates a highly undesirable amount of unexpected 
differential item performance for this analogy item. 

In Figure 6, the analogy item outside the second arc is just above .05 on 
the vertical axis and at approximately -.17 on the horizontal axis. Hence, this 
item is exhibiting little residual crossover, and a very sizeable amount of 
constant difference. Examination of Figure 4 corroborates these observations. 
This analogy item exhibits a substantial constant amount of unexpected differ- 
ential item performance. 

In contrast to this Item, most of the items fall within the first arc, which 
indicates that most of the items, 80 out of 85 in fact, exhibit acceptable levels 
of unexpected differential item performance. Of the four that fall between the 
inner and middle arcs, an antonym item that has a positive and an analogy 
item with a negative are close enough to the inner arc to be considered as 
exhibiting acceptable levels of unexpected differential item performance. The >^ 
remaining two items, a sentence completion item and an antonym, however, merit 
some careful examination. Like the analogy item outside the middle arc, these 
two items haye negative values, which indicate that female candidates 
perform poorer than expected on these items. 

On the analogy item that lies outside the middle arc, female candidates 
performed far worse than expected: - .63 vs. « .80. Inspection of the 
content of this particular analogy item revealed potential content bias against 
female candidates, as it required some knowledge of hunting and fishing, two 
traditionally male-oriented recreational activities. 
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On the sentence completion Item, female candidates performed somewhat 
lower than expected: * .40 vs. * .50. Inspection of this item itself 

revealed that the subject matter of the item, nuclear power politics, might be 
something that males traditionally have shown more interest in than females. It 
is not apparent, however, why this particular subject matter should affect 
the performance of female candidates on this item. 

Finally, on the antonym item, female candidates performed below expectation: 
P^ - .72 vs. P^ - .80. Examination of item content, however, provided no plaus- 
ible explanation for this difference. 

In sum, this analysis of the 85 verbal items on Form ZSA5 uncovered only 
one Item that exhibited a substantial amount of unexpected differential iter 
performance that probably could be attributed to item content. Only two other 
items exhibited enough unexpected differential item performance to merit exam- , 
inatlon. Most of the 85 verbal Items exhibited little unexpected differential 
item performance for female candidates. 

SAT"Mathematical Results 

Table 3 contains listings of the five indices, P^, P^, D^, RMWSD^, and P^ 
for the 60 mathematics items on Form ZSA5. In addition, these indices are / 
summarized by item type at the bottom of this table. The^^^st row at the 
bottom of Table 3 contains means and standard deviations based on 59 mathematics 
items. One math item was excluded from tK*s analysis because the percent of 
female candidates responding correctly to the item was less than .05. 

Unlike verbal test results, mean P^ (.42) for female candidates and mean 

P (.51) for male candidates are very different, reflective of the difference 
m 

between the mathematical ability distributions for males and females, and 
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Table 3 



Listing of Item Difficulty and Discrepancy Indices and Summary 
Statistics for Mathematical Items frojn SAT Form ZSA5 
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illustrative of. the need to correct for this difference prior to comparing 

male and female item performance. Note that mean (.42), in contrast to mean 

P , is very close to mean P^, demonstrating the effectiveness 6^ the standard- 
m *^ , 

ization procedure in this regard. Both and RMWSD^ have very low means, 
indicating little overall difference, as expected, between the sexes on the 
items. 

The next row of Table 3 displays the means and standard deviations of the 
five indices computed on the 20 quantitative comparison items. Female candi- 
dates' mean percent correct is extremely close to their estimated mean (i.e., 
mean » .00). The mean value of RMWSD^ is only .04. 

The last row of Table 3 presents the data for 39 regular math type items. 
Item #60 was excluded from the analysis because the female candidates' percent 
correct on this item was less than .05. These means and standard deviations 
sugge^ that little unexpected differential item performance is present. 

Figures 7 and 8 contain pictorial and numerical displays of the discrepancy 
indices for both quantitative comparison and regular mathematics item types. 
Neither the floating histogram in Figure 7 nor the plot in Figure 8 reveal 
any items that exhibit the substantial degree of unexpected differential Item 
performance observed for the one analogy item in the verbal test. Only two 
Items, in fact, fall outside the inner arc in Figure 8. Female candidates 
performed better than expected on one item, but more poorly than expected on the 
other item. The plots of male and female conditional percent corrects and the 
aifference plot for the former item are given In Figures 9 and 10, respectively, 
while Figures 11 and 12 are the corresponding plots for the latter item. Note 
that Figures 9 and 11 appear to be mirror images of each other, with female 
candidates slightly exceeding male candidates in Figure 9, while the reverse 
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Numerical and Pictorial Dlaplay of Frequenclea of Root Hean Wejlghted Squared 
Dlfferencea(RMWSD) Between the Conditional Probabllltlea of Succeaa 
for Fenale and Hale Candldatea on Mathenatlcal Iteaa froa Fom ZSA3 
Adalnlatered In December 1977 
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Figure 8 

Plot of Root Mean Weighted Squared Differences (RMWSD^) Between 
the Conditional Probabilities of Success for Male and Female 
Candidates on Mathematics Items from SAT Form ZSA5 
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*RMWSD equali the distance from the origin to the point representing the item. 
Projection of each point on the horizontal axis yields the difference between 
Pj and Pj, Dj, for that item. Projection of each point on the vertical axis 

yields the standard deviation of the weighted differences, an index of residual 
crossover. 
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Conditional Probability of Successful Performance for Both 
Males and Females on Two Math Items from SAT Form ZSA5 
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Difference Plot of Two Math Items from SAT Form ZSA5 
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occurs in Figure 11. Both figure? exhibit fairly constant differences, but 
in opposite directions. Examination of the content of these two items orovided 
no apparent explanation for these differences. Hence, i% appears that all 
mathematics items on Form ZSA5 are relatively free from unexpected differ- 
ential item performance for females, despite the fact that the mean scaled 
score for female candidates was approximately one-half a standard deviation 
lower than the male candidate mean scaled score. The standardization procedure 
effectively adjusted for this difference in overall performance. 

TSWE Total Test and Item Type Results 

Table 4 contains a listing of the five indices, P^, P^, D^, RMWSD^, and P^^, 
discussed in preceding sections, for the 50 TSWE items on Form Ell. In addition, 
these indices are summarized by item type at the bottom of the table. The first 
row at the bottom of Table 4 contains means and standard deviations based on all 
50 TSWE items, and the next two rows contain the same information for the 35 
usage type items and the 15 sentence correction items, respectively. Estimated 
percent correct (P^) means for the female candidates are very close to actual 
(P^) means across both item types combined and separately. The mean values of 
RMWSD are similar to those observed for the mathematical items. No mean differences 
appear large enough to. warrant further consideration. 

Figures 13 and 14 contain pictorial and numerical displays of the discrep- 
ancy indices for all TSWE items on Form Ell. Inspection of these figures 
reveals that only two usage items exhibit any substantial amounts of unexpected 
differential item performance. Performance on these items is depicted in 
greater detail in Figures 15-18. The female candidates performed better than 
expected (P^ - .59 vs. P^ - .50) on the item displayed in Figures 15 and 16. 
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Table 4 



Listing of Item Difficulty and Discrepancy Indices and 
Sumnary Statistics for TSWE Items from Form Ell 
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NuMrlcal and Pictorial Dlaplay of Praquanclos of loot Mtan Valghttd Squarod 
Dlffarancas(tMWSD) iattftan tha Conditional Frobabllltlaa of* Succaaa 
for PaMla and Mala Candld.itaa oo TSUE Itau froa Fona ZSA5/I11 
Adttlnlatarad In Dacaabar 1977 
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Figure 14 

Plot of Root Mean Weighted Squared Differences (RMWSD^) Between 
the conditional Probabilities of Success for Male and Female 
Candidates on TSWE Items from Form ZSA5/E11 
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Conditional Probability of Successful Item Perforaance 
for Both Males and Females on Two 
TSWE Items from Form ZSA5/E11 ^ 
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Most of this difference is constant across levels of scaled score. On the Item 
displayed in Figures 17 and 18, the female candidates did not perform as well as 
expected, (P^ - .51, - .59). Again, most of the difference is in one direc- 
tion. Note that these two items appear to cancel each other out. 

Examination of the content of these two items revealed that the item on 
which females performed better than expected concerns a woman in a professional 
occupation, while the item on which females fell short of expectation deals 
with World War II, which is generally considered an area that males study 
more than females. However, these content differences do not appear to be 
sufficient explanations for the discrepancies in the observed and expected 
performance of female candidates on these items. 

* , # t 

Summary 

This report was the first in a series of investigations seeking to uncover 
evidence relating to the presetfce or absence of unexpected differential item 
performance on operational SAT/TSWE items across different candidate subpopu- 
lations' of the SAT/TSWE test-taking population via the statistical method of 
standardization. The use of standardization enables one to control for differ- 
ences in subpoptilation ability. Standardization is a reasonable procedure for 
controlling -for differences in ability, provided the control variable is a 
reasonable measure of ability, as is total scaled score. 

Examination of summary statistics for discrepancy indices at the item type 
level revealed that there was little evidence of systematic' unexpected differ- 
ential item performance on either the SAT-M or TSWE tests. On the verbal test, 
the analogy items exhibited a mean which suggested systematic unexpected 
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differential Item performance that favored the male candidates. Elimination of 
the one analogy Item which exhibited very substantial unexpected differential 
Item performance reduces the mean for analogy Items by half when that Item Is 
included In the set. I.e., from -.02 to -.01, suggesting that with the exception 
of that one Item, the analogy Items, as a set, exhibit little unexpected differ- 
ential Item performance. 

In contrast to previous Investigations of Item fairness (see review by 
Dorans, 1982), this Investigation of differential Item performance Identified 
very few Items out of a total of 195 Items as needing careful review for 
possible content bias. Of these only one exhibited a clearly unacceptable 
degree of unexpected differential Item performance that could be attributed 
to content bias. 

Since this Is the first application of the standardization approach to 
studies of unexpected differential Item performance, future applications are 
bound to Involve modifications of the method as employed here. Certain modifi- 
cations are very likely to occur. For example, different candidate subpopu- 
latlons will be studied and, as a consequence, the range of scaled scores studied 
may be curtailed. A variation of the standardization procedure that can be used 
with small samples may be employed. For some studies, the focus may be shifted 
away from breakdowns by Item type towards breakdowns by content, where feasible. 
In short, the methodology will be refined and adapted to meet the requirements 
of future applications. 
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Appendix 

THE STANDARDIZATION APPROACH TO ASSESSING 
UNEXPECTED DIFFERENTIAL ITEM PERFORMANCE 

Since the standardization approach to assessing unexpected differential 
item performance represents a new applicati'On of an old technique to an important 
concern in applied testing, the approach will be presented in detail in this 
appendix. First, the rationale for standardization will be discussed. Then, 
the particular application of standardization will be described. In the process ' 
of describing this approach to assessing unexpected differential item performance 
several terms and concepts will be defined. The goals of this appendix are: 

(1) to convey the ^implicity and generality of the standardization approach, 
and I 

(2) to illustrate its application to the assessment of unexpected differ- 
ential item performance. 

The Need for Standardization 

Standardization is a statistical technique that enables one to compare two 
populations of individuals with respect to some variable of interest while 
controlling for differences on some other variable that is related to the vari- 
able of interest. The best way to convey the meaning and importance of standard- 
ization is to illustrate what may occur when standardization is not performed 
when it should be. Simpson's paradox is the designation for a paradoxical 
situation in which a population with a higher overall incidence of some variable 
than a second population actually has a lower incidence of that variable 
than the second population when comparisons of that variable are conditioned on 
some oth^r variable. Simpson's paradox (Wagner, 1982) can be used to illustrate 
the importance of standardization. 
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Consider the following illustration. Table 1 contains a statistical 
description of the performance of two hypothetical groups, A and B, on an 
item. Group A is composed of 100,000 candidates, while Group B is composed of 
1,000 candidates. In the body of the table, the performance of the two groups on 
the item is summarized at the far right under the column heading overall perform- 
ance. Here we note that 60,000 of the 100,000 members of Group A answered the 
item correctly, while 500 of the 1,000 members of Group B answered the item, 
correctly. Since the 60% for Group A exceeds the 50% for Group B, we might 
conclude that this particular item favors Group A over Group B/ Such an inter- 
pretation, however, would be in error because it ignores important information 
about the two groups that is contained in the rest of the table, namely that 
Group A is more able than Group B. 

To the left of the overall performance column in Table 1 are five columns 
of numbers that describe the performance on the item of subgroups of A and B 
that are classified into five mutually exclusive performance levels, L1-L5. As 
is evident in the Z-Correct rows of the table, LI is the least able subgroup, L5 
is the most able, and L2, L3 and L4 are ordered from low to high in terms of 
performance on the control variable. At each ability level, members of Group A 
are as able as members of Group B. Thus, the 35,000 members of Group A at L4 are 
as able as the 150 members of Group B at L4.t 

The numbers in the first and fifth rows of the table identify the number 

c 

of individuals In Groups A and B, respectively, at each of the performance levels, 
llvese numbers inform us that overall Group A is more able than Group B with most 
oi Group A at levels L4 and L5 and most of Group B at L2 and L3. This substantial 
difference in overall ability between Groups A and B affects the summary infor- 
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Table 1 

Performance of Two Groups of Different Ability 
on an Item that Favors the Lower Ability Group 

Ability Level 

Overall 

LI L2 L3 L4 L5 Performance 

Group A ^ 



No. of Individuals 


5000 


15000 


25000 


35000 


20000 


100000 


% at Level 


.05 


.15 


.25 


.35 


.20 




Answer Correct 


500 


A500 


12500 


24500 


18000 


60000 


Z Correct 


.1 


.3 


.5 


.7 


.9 


.6 


Group B 














No. of Individuals 


200 


350 


250 


150 


50 


1000 


% at Level 


.20 


.35 


.25 


.15 


.05 




Answer Correct 


40 


140 


150 


120 


50 


500 


% Correct 


.2 


.4 


.6 


.8 


1.0 


.5 
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matlon portrayed In the overall performance column, ^hich had led us to conclude 
that the Item favored Group A over Group B. ^ 

A closer examination of all the information in Table 1, however, leads us 
to conclude that the item, in fact, favors Group B over Group A. The evidence 
for this conclusion is contained in the fourth and eighth rows of Table 1, 
which contain the percent correct for each of the five ability levels in groups 
A and B, respectively. Note that at each ability levels a larger percentage of 
Group B members answer the item correctly than do Group A members of comparable 
ability. This analysis, conditioned on ability level, indicates that this item 
favors Group B over Group A because the probability of successful performance 
on the item is .1 higher for Group B than Group A at each of the five ability 
levels. Simpson's paradox refers to the fact that the analysis conditioned on 
ability level contradicts the analysis based on a simple comparison of overall 
performance of the two groups on the item, i.e., the analysis based on the data 
in the overall performance column of Table 1. 

Standardization with respect to ability level removes the paradox in the 
item performance analyses by producing a simple total group comparison, like 
that based on the overall performance column, which is not confounded by 
differences in group ability. Standardization accomplishes this goal by using 
the same standard ability distribution for both groups. 

Definitions 

In the balance of this appendix, the following definitions will be employed 
to designate various subgroups and variables used by the standardization approach 
to the assessment of unexpected differential item performance : 
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Variables * There are two types of variables: study and control > The 
study variable Is the variable of Interest, while the control variable Is a 
variable that Is related to the study variable and which must be controlled 
while making comparisons of the study variable. In the example under consider- 
ation, performance on the Item expressed as percent correct Is the study variable 
while ability level Is the control variable. Since percent correct Is related 
to ability level, the latter must be controlled for during comparisons of the 
former. 

Groups . There are three types of groups: study , standardization , and 
base . The study group, as the phrase Implies, Is the group under study. In 
any given Investigation, there are as many potential study groups as there ^re 
potential subgroups In a population. In actuality, certain subgroups^ e.^. 
Blacks, are more likely to be study groups because of concerns about the rele- 
vance of tests for these subgroups. 

The standardization group supplies the ability distributions used by the 
standardization approach. In any comparison of two groups, three possible 
standardization groups immediately suggest themselves: either of the two 
groups or a composite of the two groups. While all three of these groups are 
based on actual data, the standardization approach is not limited to standardi- 
zation groups based on actual data. A hypothetical ability distribution con- 
structed to suit some desiderata could be used a^ the standardization group. 

The base group supplies the model for the data to the standardization 
process. The model for the data expresses the study variable as a function of 
the control variable. In assessing unexpected differential item performance, 
the model is the expected performance on the item conditioned on ability. I.e., 
the expected probability of successful performance on the Item given ability 

44 



- 6 



level. As in the case of the study group, there are as many potential base, 
groups as there are potential subgroups. -A subgroup cannot be both the study 
group and the base group in the same analysis, however. To achieve a stable 
model for data, the base group should be as large as possible. To avoid part- 
total group contaminations, the base group should be independent of the various 
study groups In an investigation. 

In ^.avestigatlons of unexpected differential Item performance, the model 
for the data can be empirical or theoretical. An example of an empirical 
model in an investigation of unexpected differential item performance in a Black 
study group would be the conditional percent correct in a white base group. If 
an adjustment of percent correct for not reached, omits and number wrong served 
as the ||^^^ study group, an empirical model of the data would be the 

comparable adjusted percent correct observed in the base* group. Further 
discussion of adjusted percent correct is reserved for the mathematical formal- 
ization presented latter in this appendix. 

The various models of item response theory (Lord, 1980) are examples of 
theoretical models for the data. This appendix is limited to empirical models 
for the data. 



The mathematical formulation of the standardization approach to assessing 
unexpected differential Item performance can be described in several stages, 
each of which focuses on a different component. These components are: 



Mathemat ical Formalization 



I. Observed Study Group Data 



Basic Data 



B. Derived Data to be Modelled 
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II. The Model for the Data 
III. Definition of the Standardization Group 
IV. Statistical Indices of Unexpected Differential Item Performance 

Observed Study Group Data 

In the balance of this appendix, the following indices will be employed: 

- g is the subscript for subgroup and ranges from 1 to G, where G is the 

number of subgroups; 

- s is the subscript for scaled score or ability level and ranges from 1 to 

S, where S is the number of scaled score levels. For SAT-V and SAT^, 
S is 61; for TSWE, S is 41; 

- r is a response type indicator for which 

1 =■ correct response 

2 » incorrect response 

3 « omit 

4 « not reached. 

Basic Data. The basic data are counts, N , i.e., the number (frequency) 
gsr 

of people in subgroup g at ability level s who gave response type r to the item. 

0 

For example, N , is the number of people in g at ability level s who responded 
gsl 

correctly to the item, while N^^^ is the number of people in g at ability level 
s who omitted the item. If we let represent a simple unweighted sum, then 
Ng^^ is the number of people in g at s. In addition, N^^^ - N^^^ is the number 
of people in g at a .who reached the item. 

Derived Data to be Modelled . Some variation of percent correct are the 
data to be modelled for unexpected differential item performance. Simple 
percent correct at ability level s in subgroup g is defined as 
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(1) P - N , / N . . 

^ ' gs gsl g8+ 

An alternative percent correct Involves a correction for not reached, 

Yet another "adjusted" percent cotrect entails an adjustment for guessing^ 

(3) Pg3(GA) . (Ng3j - Ngs2/(^-l» / V+ 

where k Is the number of options In the multiple choice question. Choice of 
"percent correct" depends on the purposes of the Investigation. Various choices, 
such as (1) - (3) above, can be obtained as a special case of a general formula 
for the data, ^ 

Z N * 
r-1 

(4) Pg,(W^) - V 

Z N * w 
r-1 8" 

where Is the rth element In the vector of weights applied to N^^^ to 

obtain the numerator of P (W ), while w^ is the rth element in the vector of 

gs r r 

weights applied to N to obtain the denominator of P^^(W^). For equations 
^ — r ^^'^ gsr gs r 

(1) to (3) above, the corresponding weight vectors, and are: 



Equation 



R, W, 0, NR R. W. 0, NR 

(1) (1. 0, 0, 0) (1, 1, 1, 1) 

(2) (1, 0, 0, 0) (1, 1, 1, 0) 

(3) (l,-l/(k-l), 0, 0) (1, 1, 1, 1) 
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Choice of W and W for use In (4) determines the data ^^AV) to be modelled. 
— r — r gs r 

In the example in Table 1, simple percent correct, equation (1), was used to 
obtain the data to be modelled. Dividing the nunbers in the third and seventh 
rows by the numbers in the first and fifth rows, respectively,' provides the 
simple percent corrects contained in the fourth and eighth rows, respectively, 
of Table 1. For example, the .5 (^^3) for group A at score level L3 is 
obtained by dividing 12,500 (N^^^jp by 25,000 (N^^j^.). 

The Model for the Data 

The data are defined as the percent correct for the study group. For an 
empirical model, the model for the data is simply the same percent correct 
for the base group. Both the data and the empirical model for the data 
are obtained via equation (4). For the data, the subscript g refers to the 
study group. Likewise, for the model, the subscript g refers to the base group 

When the data base is sufficiently large, as in the case with the SAT, it 
is often sensible to use the largest subgroup as the base group. In that case 
the model for the data can be obtained via a straightforward application of 
equation C4). In the hypothetical example depicted in Table 1, the base group 
model values for simple percent correct data are simply the observed percent 
cotrect data for group A, which are listed in the fourth row. 

Definition of the Standardization Group 

The standardization group supplies the standard ability distributions used 
by the standardization approach. Any of the G subgroups can be used as the 
standardization group. Since the standard ability distribution serves as a 
weighting function, it is advisable to use each study group as Its own standard 
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Izatlon group thereby using a weighting function that mirrors the relative 
frequency at each score level In the study group. 

Formalizing the role of the standard ability distribution In the standardi- 
zation process Illustrates how It serves as a weighting function. As the phrase 
might Imply, "unexpected differential Item performance" focuses on unexpected 
differences In Item performances. Controlling for differences In subgroup abil- 
ity through standardization, enables ud to label as unexpected any difference 
between actual and expected Item performance. For subgroups composed of equally 
able members, there should be no differences In Item performance. For the 
SAT and TSWE, reported scaled scores are highly reliable measures of the devel- 
oped abilities assessed by that testing Instrument. It Is therefore- reasonable 
to presume that Individuals at the same scaled score ability level across 
subgroups should have the same probability of successful performance on the 
Item. Hence unexpected differential Item performance focuses on differences In 
Item performance at fixed score levels. For SAT-V and SAT-M, there are 61 
reported score levels, and for TSWE, there are 41 reported score levels. 
Standardization affords us with a simple way of summarizing unexpected differ- 
ences in each item performance across score levels. For both SAT-V and SAT^> 
it enables us to reduce 61 potential differences to two summary indices without 
the confounding effects due to differences in group ability. For TSWE, 41 
potential differences are reduced to two summary indices. 

Statistical Indices of Unexpected Differential Item Performance 
At each score level s, in group g, we have the difference, 

(5) D « P - P , 

. gs gs gs 



43 



where P is observed data defined in (4) using the study groups counts, N » 
gs K***^ 



and P is the model for the data defined via (4) using the base groups counts. 



In equation (5), D is a conditional difference between the data and the model. 

gs 

Let W be the standardization group weighting function for study group g. A 
sensible weighting function containsf _ the relative frequencies of scaled score s 
in study group g, i.e., 

(6) W - N . / N _ 
^ gs gs+ g++ 

where N . is the number of individuals in group g at score level s and N ^ is 
g8+ 8^ 

the number of individuals in group g across all s score levels. 

Applying each W^^ to its corresponding conditional difference and summing 
across score levels yields a mean weighted difference, 

S 

(7) D « 2 W D 

an overall difference between Liie data and the nodel for percent correct. 
This difference is one index of unexpected differential item performance 
supplied by standardization with respect to ability. A second index is the 
mean weighted squared difference, 

S 2 

(8) MWSD - E W D 

s-1 88 88 

which can be rewritten as 

(9) MWSD - J^Wg^V^g^ 

> 

which implies that each difference is weighted by itself aa well by the 

weighting function associated with the standardization group. The J square root 

- -' 1 i 

. / 
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of MWSD is also an index of discrepancy, RMWSD, that is on a scale that is 
comparable to D • 

To illustrate the standardization process, let us return to the data in 
Table 1. Suppose Group B were the study group, chosen as such because its lower 
ability level led critics of testing to believe that test items were biased 
against- Group B. Since there arS iOO.OOO individuals in group A, it was chosen 
as the base group. Since we are primarily interested in study group B, its 
Ability distribution supplies us with a natural weighting function. Hence, the 
data, model and weighting function are: 



BS 

- .20 

- .35 

- .25 

- .15 

- .05 





^BS 




^BS ■ 


■ ^AS 


W 


LI: 


40 =■ 
200 


.20 


500 
5000 


» .10 


200 
1000 


L2: 


140 - 

350 


.40 


4500 
15000 


- .30 


350 
1000 


L3: 


150 - 

250 


.60 


12500 
25000 


- .50 


250 
1000 


L4: 


120 - 

150 


.80 


24500 
35000 


- .70 


150 
1000 


L5: 


50 - 

50 


1.0 


18000 
20000 


- .90 


50 
1000 



E - 1.0 



Note that, as with all weighting functions, Z W^g » 1.0. Using the information 
above, we obtain 
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P P 
BS BS 


- ^AS 






"bs°bs 


**BS BS 


LI: 


.2 


.1 


.1 


.20 


.020 


.0020 


L2: 




.3 


.1 


.35 


.035 


.0035 


L3: 


.6 


.5 


1 


.25 






L4! 


.8 


.7 


.1 


.15 


.015 


.0015 


L5: 


1.0 


.9 


.1 


.05 
Z -1.0 


.005 
Z -.1 


.0005 
Z -.01 


row 


above reveals 


that 


- .1 


and MWSDg 


- .01 when 


Group A 1 



group. Note that. - MWSDg. which indicates that all the sum of squared 
differences are due to the constant difference of .1 observed at each score 
level. 

Contrasting Standardization With Ot her Approaches 

The assessment of unexpected differential item performance is an important 
concern in applied testing. As such it has attracted much attention, e.g., 
Berk's (1982) Handbook of Methods for Detecting T est Bias. From the title of 
Berk's volume one might infer that several methods for bias detection exist, and 
the contents of the volume confirm this inference. The intent of this closing 
section is to place the standardization approach within the context of the 
methods included in the Berk volume. 

Scheuneman (1981) makes a distinction between two general types of item 
bias definitions: definitions related to an item-by-group interaction, e.g., 
Angoff 's (Angoff and Ford, 1973) transformed item difficulty approach, and defi- 
nitions that involve conditioning on ability, e.g., item response theory 
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approaches (Lord, 1980). Unexpected differential item performance is clearly a 
definition involving conditioning on ability. The standardization approach to 
assessing unexpected differential item performance is most akin to item response 
theory methods. 

In item response theory approaches, parameterized item-ability regressions, 
or item response functions, for different subgroups are computed and compared. 
In the standardization approach, unparameterized item-test regressions are 
compared. While the parametric nature of the item response methods are more 
elegant, the particular model (e.g., one-parameter), may not fit the data and 
the lack of fit might be misconstrued as bias. In contrast, unparameterized 
item-test regressions will not suffer from model fit problems. Like any method 
that uses an internal criterion, however, unparameterized item-test regressions 
are subject to bothersome item-total contaminations. 

While the standardization approach is more akin to parametric item response 
theory methods, it shares some of the simplicity of the transformed item diffi- 
culty or delta-plot method. It too results in "transformed" item difficulties, 
namely the predicted p-values obtained from applying the marginal ability 
distribution of the standardization group to the base group conditional item 
success curves. These predicted p-values are the item difficulties one would 
expect if both the base group and the study group had ability distributions like 
that of the standardization group. These predicted difficulties should be 
identical because ability has been directly controlled for through standardi- 
zation. Any substantial deviation from identity could be construed as evidence 
of unexpected differential item performance, evidence stated in the simple 
metric of proportion answering an item correctly. 
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